{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "15690d6f-6515-4b65-9a8f-b907f90711ea",
   "metadata": {},
   "source": [
    "## Lesson 6: Semantic Search, Building a Q&A System"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f6eba99-d01c-469d-9d88-a4d73cbf12cb",
   "metadata": {},
   "source": [
    "#### Project environment setup\n",
    "\n",
    "- Load credentials and relevant Python Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "600b34d4-4b4a-46f9-a0d6-9a312a09d1d0",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "from utils import authenticate\n",
    "credentials, PROJECT_ID = authenticate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc1b1938-32fb-4e32-ac23-35ce13cbd62f",
   "metadata": {},
   "source": [
    "#### Enter project details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a33f57d4-096e-448b-aadf-d208deb84696",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6406271-fa6a-4293-aa61-bf0febf855af",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "import vertexai\n",
    "vertexai.init(project=PROJECT_ID, location=REGION, credentials = credentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fe50e74-21cc-4cb8-9595-7ad2df27af06",
   "metadata": {},
   "source": [
    "## Load Stack Overflow questions and answers from BigQuery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "300058f6-3372-47fd-a4f5-63e92f7af612",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eeb1e4da-a583-493b-8f6b-e30e69281e24",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "so_database = pd.read_csv('so_database_app.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c31de31c-ca30-4a11-8fbc-8f179cde839a",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "print(\"Shape: \" + str(so_database.shape))\n",
    "print(so_database)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "923834d6-e065-44c5-9032-0af3d032dd74",
   "metadata": {},
   "source": [
    "## Load the question embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c31d02d1-632d-4139-bc31-d1dd4c760f26",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "from vertexai.language_models import TextEmbeddingModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fcae98f-3d58-40dc-9244-65b084ce2910",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "embedding_model = TextEmbeddingModel.from_pretrained(\n",
    "    \"textembedding-gecko@001\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7cb7c15-1f60-4ae1-9eee-cacf91ce40be",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from utils import encode_text_to_embedding_batched"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fc34bc0-f2ec-4445-a3c8-2c49957ad167",
   "metadata": {},
   "source": [
    "- Here is the code that embeds the text.  You can adapt it for use in your own projects.  \n",
    "- To save on API calls, we've embedded the text already, so you can load it from the saved file in the next cell.\n",
    "\n",
    "```Python\n",
    "# Encode the stack overflow data\n",
    "\n",
    "so_questions = so_database.input_text.tolist()\n",
    "question_embeddings = encode_text_to_embedding_batched(\n",
    "            sentences = so_questions,\n",
    "            api_calls_per_second = 20/60, \n",
    "            batch_size = 5)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8545a2ba-3d8c-4a61-9fd4-a9687df824e8",
   "metadata": {
    "height": 131
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "with open('question_embeddings_app.pkl', 'rb') as file:\n",
    "      \n",
    "    # Call load method to deserialze\n",
    "    question_embeddings = pickle.load(file)\n",
    "  \n",
    "    print(question_embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d11d36e-e70a-4981-a37b-6b978520d90f",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "so_database['embeddings'] = question_embeddings.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9b8f8e5-4989-47cc-a3b1-55e782dea2d0",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "so_database"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "024e3893-c558-4a35-8680-dc95cd70e417",
   "metadata": {},
   "source": [
    "## Semantic Search\n",
    "\n",
    "When a user asks a question, we can embed their query on the fly and search over all of the Stack Overflow question embeddings to find the most simliar datapoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e84c0523-8bb7-44f2-b437-28b7ca14428f",
   "metadata": {
    "height": 97
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from sklearn.metrics import pairwise_distances_argmin as distances_argmin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f0a36b9-2929-4f83-bee7-2a6b82c30787",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "query = ['How to concat dataframes pandas']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96226ac6-0be7-4243-aede-df8226cfc1b2",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "query_embedding = embedding_model.get_embeddings(query)[0].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a2b5de8-eede-4cba-ae6e-5a56d14eb2b1",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "cos_sim_array = cosine_similarity([query_embedding],\n",
    "                                  list(so_database.embeddings.values))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f7f8418-93c4-416e-8df8-b15e96aa7bdf",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "cos_sim_array.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "582821c0-b31b-43ef-814e-5e5534b2d23c",
   "metadata": {},
   "source": [
    "Once we have a similarity value between our query embedding and each of the database embeddings, we can extract the index with the highest value. This embedding corresponds to the Stack Overflow post that is most similiar to the question \"How to concat dataframes pandas\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45c4a855-e9bc-4283-b93a-13a76fed5adc",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "index_doc_cosine = np.argmax(cos_sim_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b47bc5ad-c085-4b4e-9984-8bdb2afa05eb",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "index_doc_distances = distances_argmin([query_embedding], \n",
    "                                       list(so_database.embeddings.values))[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5aa5ae70-9109-4ff6-b5bc-e0e587abe177",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "so_database.input_text[index_doc_cosine]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d957e90-c952-4d09-bca7-dbc7b9d3b2fb",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "so_database.output_text[index_doc_cosine]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7117a803-e2d0-460e-a431-2aeaf6c35a78",
   "metadata": {},
   "source": [
    "## Question answering with relevant context\n",
    "\n",
    "Now that we have found the most simliar Stack Overflow question, we can take the corresponding answer and use an LLM to produce a more conversational response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "224a6e5c-360d-4652-8082-a23a0e4eb445",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "from vertexai.language_models import TextGenerationModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61d1378f-8d7c-41f1-be5a-957a2855ebd0",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "generation_model = TextGenerationModel.from_pretrained(\n",
    "    \"text-bison@001\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1069f11-18f2-4f35-ade2-08cbde27a6c9",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "context = \"Question: \" + so_database.input_text[index_doc_cosine] +\\\n",
    "\"\\n Answer: \" + so_database.output_text[index_doc_cosine]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f19e0aba-8d5f-4997-b754-73e32d54b27e",
   "metadata": {
    "height": 165
   },
   "outputs": [],
   "source": [
    "prompt = f\"\"\"Here is the context: {context}\n",
    "             Using the relevant information from the context,\n",
    "             provide an answer to the query: {query}.\"\n",
    "             If the context doesn't provide \\\n",
    "             any relevant information, \\\n",
    "             answer with \\\n",
    "             [I couldn't find a good match in the \\\n",
    "             document database for your query]\n",
    "             \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8460c33c-24db-4698-9e2c-509ff7629c73",
   "metadata": {
    "height": 148
   },
   "outputs": [],
   "source": [
    "from IPython.display import Markdown, display\n",
    "\n",
    "t_value = 0.2\n",
    "response = generation_model.predict(prompt = prompt,\n",
    "                                    temperature = t_value,\n",
    "                                    max_output_tokens = 1024)\n",
    "\n",
    "display(Markdown(response.text))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82a6ad63-2511-411b-b25f-33ddc2eb06fd",
   "metadata": {},
   "source": [
    "## When the documents don't provide useful information\n",
    "\n",
    "Our current workflow returns the most similar question from our embeddings database. But what do we do when that question isn't actually relevant when answering the user query? In other words, we don't have a good match in our database.\n",
    "\n",
    "In addition to providing a more conversational response, LLMs can help us handle these cases where the most similiar document isn't actually a reasonable answer to the user's query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2a14e45-63be-4f5d-978c-046de430be41",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "query = ['How to make the perfect lasagna']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "232818b5-281b-42df-a41a-a4a11e720113",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "query_embedding = embedding_model.get_embeddings(query)[0].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "000c1979-bb51-4dbd-921a-53d43b603a1b",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "cos_sim_array = cosine_similarity([query_embedding], \n",
    "                                  list(so_database.embeddings.values))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ccfded1-bf83-4d01-bdfa-a7dfd9f7eeee",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "cos_sim_array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9c1e7ff-8db2-4f21-a3d3-7fdd2084bfc1",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "index_doc = np.argmax(cos_sim_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6b607d8-af01-4f0b-8ab6-53d9b7e9fba4",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "context = so_database.input_text[index_doc] + \\\n",
    "\"\\n Answer: \" + so_database.output_text[index_doc]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e5688d3-12ae-4472-8bb8-3e0496424258",
   "metadata": {
    "height": 148
   },
   "outputs": [],
   "source": [
    "prompt = f\"\"\"Here is the context: {context}\n",
    "             Using the relevant information from the context,\n",
    "             provide an answer to the query: {query}.\"\n",
    "             If the context doesn't provide \\\n",
    "             any relevant information, answer with \n",
    "             [I couldn't find a good match in the \\\n",
    "             document database for your query]\n",
    "             \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3685dd6a-7d08-4348-8ceb-7e1a640bfb2a",
   "metadata": {
    "height": 97
   },
   "outputs": [],
   "source": [
    "t_value = 0.2\n",
    "response = generation_model.predict(prompt = prompt,\n",
    "                                    temperature = t_value,\n",
    "                                    max_output_tokens = 1024)\n",
    "display(Markdown(response.text))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07ca1481-e353-46b0-8982-81db8eb23567",
   "metadata": {},
   "source": [
    "## Scale with approximate nearest neighbor search\n",
    "\n",
    "When dealing with a large dataset, computing the similarity between the query and each original embedded document in the database might be too expensive. Instead of doing that, you can use approximate nearest neighbor algorithms that find the most similar documents in a more efficient way.\n",
    "\n",
    "These algorithms usually work by creating an index for your data, and using that index to find the most similar documents for your queries. In this notebook, we will use ScaNN to demonstrate the benefits of efficient vector similarity search. First, you have to create an index for your embedded dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "320eee3a-4fdd-42e0-bfff-6c4504499447",
   "metadata": {
    "height": 148
   },
   "outputs": [],
   "source": [
    "import scann\n",
    "from utils import create_index\n",
    "\n",
    "#Create index using scann\n",
    "index = create_index(embedded_dataset = question_embeddings, \n",
    "                     num_leaves = 25,\n",
    "                     num_leaves_to_search = 10,\n",
    "                     training_sample_size = 2000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bdb20d1-6f09-4afc-9ce3-be927b48aa15",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "query = \"how to concat dataframes pandas\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5a3d624-b505-48a3-a639-ba976f57d920",
   "metadata": {
    "height": 216
   },
   "outputs": [],
   "source": [
    "import time \n",
    "\n",
    "start = time.time()\n",
    "query_embedding = embedding_model.get_embeddings([query])[0].values\n",
    "neighbors, distances = index.search(query_embedding, final_num_neighbors = 1)\n",
    "end = time.time()\n",
    "\n",
    "for id, dist in zip(neighbors, distances):\n",
    "    print(f\"[docid:{id}] [{dist}] -- {so_database.input_text[int(id)][:125]}...\")\n",
    "\n",
    "print(\"Latency (ms):\", 1000 * (end - start))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e44cec14-4001-4207-8ee2-03917482d94a",
   "metadata": {
    "height": 182
   },
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "query_embedding = embedding_model.get_embeddings([query])[0].values\n",
    "cos_sim_array = cosine_similarity([query_embedding], list(so_database.embeddings.values))\n",
    "index_doc = np.argmax(cos_sim_array)\n",
    "end = time.time()\n",
    "\n",
    "print(f\"[docid:{index_doc}] [{np.max(cos_sim_array)}] -- {so_database.input_text[int(index_doc)][:125]}...\")\n",
    "\n",
    "print(\"Latency (ms):\", 1000 * (end - start))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4c2f366-e8ac-4da7-8d40-c970d171b7f8",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
